-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Fix:read file encoding #3554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix:read file encoding #3554
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @aheizi, Thank you for your contribution. I apologize we took so long to review your PR.
Looking at the whole flow, it seems like we're doing encoding detection twice - once with chardet and then again with your custom logic. Could we simplify this to just use chardet's result?
Thank you again for your contribution and patience, I'm looking forward to getting this PR ready for review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this alwaysTextExtensions array is also defined in extract-text.ts but with a different format (dots vs no dots). Should we maybe centralize this list somewhere to avoid duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, It has been extracted as a public constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maybe log these errors for debugging? Silent failures could make it hard to troubleshoot encoding issues later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This catch block handles failures during the process of attempting different encoding and decoding files. This is an expected possible situation rather than a serious error. Code design involves trying multiple encodings until the best match is found, so it is normal for some encodings to fail.
If it is to be added, I think a debug log can be added, but I haven't found any places where the debug log is used in the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For large files, wouldn't decoding the entire buffer multiple times be slow? Have you considered just trusting chardet's detection and only falling back if that fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point — decoding large buffers multiple times is definitely something to watch out for in terms of performance.
You’re absolutely right that for large files, it makes sense to avoid trying multiple encodings upfront. One improvement I’m planning is to set a file size threshold (e.g. 1MB):
- For small files, we keep the current logic — try several likely encodings and score the result.
- For large files, we’ll first trust chardet and decode using its result. Only if the decoded content looks suspicious (e.g. low score, unreadable characters), we’ll fallback to trying a few alternatives like gb18030.
This way we preserve accuracy for tricky cases (e.g. GBK-encoded .js or .txt files that start with mostly ASCII), while avoiding unnecessary work for large files where chardet is usually good enough.
Happy to push this change if it sounds good to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you arrive at the 0.05 threshold? Have you tested this with different types of files to see if this value works well across various scenarios?
I'm curious about this custom scoring system, would just trusting chardet's detection be enough in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scoring system was introduced as a safeguard against misdetections from chardet, especially in East Asian contexts. In practice, chardet often misclassifies GBK/GB18030 files as UTF-8 if the text is mostly ASCII (e.g. source code with only occasional Chinese comments). A simple confidence score from chardet doesn’t always reflect actual readability.
The scoreText function favors encodings that produce a reasonable amount of Chinese or full-width characters, and penalizes pure ASCII results. The 0.05 threshold came from empirical testing across a mix of file types:
- UTF-8 Chinese content typically scores around 0.2–0.6
- GBK files decoded incorrectly as UTF-8 usually get negative or near-zero scores
- Pure ASCII text tends to score around -1
Hi, @daniel-lxs Thank you for taking the time to review this PR! You’re right — the flow involves detecting encoding with chardet, and then trying multiple candidate encodings including the one from chardet. The reason for this is that chardet’s detection can often be unreliable, especially for short or ambiguous files (e.g. GBK-encoded .js or .txt files that contain mostly ASCII). In such cases, decoding only with chardet’s top guess can lead to misinterpretation or mojibake. The secondary scoring pass (via scoreText) helps us choose the most plausible decoding result among a few likely encodings, particularly prioritizing those common in Chinese or East Asian contexts (utf-8, gb18030, shift_jis, etc.). |
|
Hi @aheizi, thanks for your work on fixing the file encoding issues. This is an important area to get right. The current approach in While this might work for the cases you've tested, relying on many custom rules like this can make the solution complex and potentially unreliable as we encounter different files or new situations in the future. It can also make the code harder to understand and maintain. We need to find a simpler and more robust way to handle file encodings. For example, we should explore:
The aim is to have a solution that is dependable and easier to maintain, rather than a complex system of custom checks. A clear method for common encodings with a well-defined, simple fallback is generally preferred. Could you explore these simpler approaches? |
Hi @daniel-lxs , thanks a lot for your thoughtful feedback — I really appreciate it. Regarding your suggestions: Given those limitations, I opted for the current approach, which is admittedly more complex, but has worked reliably in the cases I tested. It includes some heuristics and fallback logic that aim to cover common scenarios while still falling back to UTF-8 if detection fails. That said, I completely agree with the goal of simplifying this logic. If we can find a more robust and maintainable way to handle encoding detection — especially one that avoids custom heuristics — I’m absolutely open to revisiting the current implementation. For now, I think this version is a step forward in terms of correctness and can serve as a foundation we can refine. Thanks again for the suggestions — I’d love to keep the discussion going if you have any further ideas. |
354a6a7 to
3f36e52
Compare
|
@daniel-lxs In the latest submission, I referred to the implementation of encoding in vscode. https://github.com/microsoft/vscode/blob/main/src/vs/workbench/services/textfile/common/encoding.ts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using the built-in buffer.toString("latin1") instead of manually looping over the buffer in encodeLatin1 for improved performance and clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to VS Code comments.
// before guessing jschardet calls toString('binary') on input if it is a Buffer,
// since we are using it inside browser environment as well we do conversion ourselves
// https://github.com/aadsm/jschardet/blob/v2.1.1/src/index.js#L36-L40
|
Hey @aheizi, Thank you for taking the time to tackle this issue, unfortunately I don't think the implementation aligns with our goals, this adds quite a complex rating system to detect encoding and that complexity makes it hard to test and maintain. This doesn't mean your implementation is bad or that the issue is not important, we just want a simpler solution to this issue if exists. I'll close this issue but feel free to continue the discussion. I'll also gladly answer any questions you might have. Thank you again! |
Co-authored-by: wangyj20 <[email protected]>
Related GitHub Issue
#3555
Description
Currently, roo-code reads files by default according to utf-8. When the file encoding is GBK or others, it will cause garbled text problems
Test Procedure
manual testing
Type of Change
srcor test files.Pre-Submission Checklist
npm run lint).console.log) has been removed.npm test).mainbranch.npm run changesetif this PR includes user-facing changes or dependency updates.Screenshots / Videos
before:

after:
Documentation Updates
Does this PR necessitate updates to user-facing documentation?
Additional Notes
Important
Introduces
readFileWithEncodingto handle multiple file encodings, replacingfs.readFilein key tools to prevent garbled text issues, and adds necessary dependencies.readFileWithEncodinginreadFileWithEncoding.tsto handle multiple file encodings, including UTF-8, UTF-16, and GBK.fs.readFilewithreadFileWithEncodinginapplyDiffTool.ts,insertContentTool.ts,searchAndReplaceTool.ts, andDiffViewProvider.tsto prevent garbled text issues.iconv-liteandjschardettopackage.jsonfor encoding detection and conversion.extract-text.tsto usereadFileWithEncodingfor non-binary files.This description was created by
for 3f36e526a3a5a0f4668b4c53ff205afa3db26a33. You can customize this summary. It will automatically update as commits are pushed.